Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

This is my course diary for the Open Data Science course 2022.

Link to my GitHub repository: https://github.com/emselina/IODS-project.git

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Dec  4 21:22:02 2022"

Assignment 1.

Link to my GitHub repository: Elina GitHub IODS

https://github.com/emselina/IODS-project.git

Just write some of your thoughts about this course freely in the file, e.g., How are you feeling right now? What do you expect to learn? Where did you hear about the course?

I feel curious about the course and interested to learn new things. I expect to learn more about R and RStudio. GitHub I have never used before so that I would like to learn to deposit my own R codes in the future. I would like to learn the core themes of open data science and how to use them in my own research.

Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?

I think the Health Data Science book contained a lot of useful information about R and RStudio and also good examples.

My favorite topic in the book was the different types of plots section and also the R basics. I have not used the pipe function %>% before so that was useful information. The most difficult parts I believe will be the data analysis part: linear regression, logistic regression etc. that I have not done before.

I also created a Personal Access Token (PAT) but everything worked before creating it as well.


Assignment 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Sun Dec  4 21:22:02 2022"

Here we go again…

Data wrangling

The data wrangling part and the R scipt created are in the data folder in: https://github.com/emselina/IODS-project/blob/master/data/create_learning2014.R

Data analysis

1. Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt . (The separator is a comma “,” and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)

I did the data wrangling part so I will use the data that I created:

#students14 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)

students14 <- read.csv("data/learning2014.csv", sep=",", header=TRUE)
dim(students14)
## [1] 166   7
str(students14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This students2014 dataset has 166 rows and 7 columns. So this dataset has 166 observations and we excluded those who have obtained 0 for Exam Points. This dataset is from international survey of Approaches to Learning. In the survey basic details about age and gender were asked and questions were divided to deep questions (deep), strategic questions (stra), surface questions (surf). In addition points from the exam of the course is reported in the dataset and global attitude towards statistics. Attitude and the questions are measured on the Likert scale (1-5). So there is 7 variables is this dataset: gender,Age, Attitude, deep, stra, surf and Points.The attitude variable was created in the data wrangling part by dividing by 10 the original attitude data to get it to the same scale as others. The deep, stra and surf were created by taken the mean values of each type of questions.

2. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

summary (students14)
##     gender               Age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(students14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

From the summary of the data the minimum Age is 17 and highest 55, mean age 22.5. The attitude points minimum is 1.4 and maximum 5.0, with mean of 3.14 and the min for deep is 1.58, max 4.9 and mean 3.68. The minimum for stra is 1.25, max 5.0 and mean 3.1. For surf minimum value is 1.58, max 4.3 and mean 2.7. For Points the min is 7 max 33 and mean 22.7.

The variable gender is categorical variable so it can not be used in the statistical tests as a variable.

From the graphigal overview is can be see that the variable Age is a positively skewed distribution, variable attitude shows almost like normal distribution but some differences are between genders, the females are skewed to the right and males skewed to the left. The variable deep shows negatively skewed distribution and distribution of surf slightly positively skewed distribution.

There is statistically significant positive correlation between points and attitude in both females and males. The variable with the second highest absolute correlation with Points is stra and third highest absolute correlation with Points is surf

There is also negative correlation between surf and deep (figure below), but only significant correlation is in the male participants. There is also negative correlation between surf and attitude and surf and stra.

surf_deep <- ggplot(students14, aes(x = surf, y = deep)) +
  labs(title="Plot of surf and deep questions",x="surf", y = "deep")
surf_deep <- surf_deep + geom_point(colour = "blue") + theme_classic() + geom_smooth(method = "lm")
surf_deep
## `geom_smooth()` using formula 'y ~ x'

3. Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)

Multiple regression

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + surf + stra, data = students14)

my_model %>% 
  summary()
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = students14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The Residuals:

A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. In this case the median residuals is 0.5156 so it is quite close to 0, it is positive because residuals are above the regression line.

Intercept: This means the average exam score is 11.1 when the attitude towards statistics learning is 0 and score for strategic, deep and surface questions is 0.

Std.Error: shows the standard deviation of the coefficient. The standard error is used to create confidence intervals.

t-value: The t-value measures the size of the difference relative to the variation in your sample data. Put another way, T is the calculated difference represented in units of standard error. The greater the magnitude of T, the greater the evidence against the null hypothesis. This means there is greater evidence that there is a significant difference. The closer T is to 0, the more likely there isn’t a significant difference. Compared to the strategic learning and the deep learning, the t value of attitude is not close to 0, indicating that the coefficient is not zero.

Multiple R-squared: This is known as the coefficient of determination. It is the proportion of the variance in the response variable that can be explained by the explanatory variables. In this example, 20% of the variation in the exam scores can be explained by the attitude..

To conclude the only explanatory variable that has significant relationship with the target variable points is attitude so I decided to drop the other two explanatory variables surf and stra. This is based on the Pr(>|t|) values and for surf and stra these values are above 0.1 so these have no significant correlation with points. The Pr(>|t|) value of attitude is 1.93e-08 *** so there is significant positive correlation with points.

Fit the model without explanatory variables that do not have a statistically significant relationship with the target variable.

library(GGally)
library(ggplot2)

library(ggplot2)
qplot(points, attitude, data = students14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# create a regression model with only statistically significant explanatory variables
my_model2 <- lm(points ~ attitude, data = students14)

my_model2 %>% 
  summary()
## 
## Call:
## lm(formula = points ~ attitude, data = students14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

4. Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)

my_model2 %>% 
  summary()
## 
## Call:
## lm(formula = points ~ attitude, data = students14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The linear regression coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p values for the coefficients indicate whether these relationships are statistically significant.

The Residuals:

A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. In this case the median residuals is 0.4339 so it is quite close to 0, it is positive because residuals are above the regression line.

Coefficients:

The p-value tells if there is a significant correlation between the explanatory variable and target variable. The p-value of the my_model2 is 4.12e-19 so attitude has significant correlation with the points received from the exam.

Intercept: This means the average exam score is 11.6 when the attitude towards statistics learning is 0.

Std.Error: shows the standard deviation of the coefficient. The standard error is used to create confidence intervals.

t-value: according to the t-value we can reject the null hypothesis: the slope for attitude is not equal to zero.

Multiple R-squared: This is known as the coefficient of determination. It is the proportion of the variance in the response variable that can be explained by the explanatory variables. In this example, 19% of the variation in the exam scores can be explained by the attitude..

5. Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)

par(mfrow = c(2,2))

plot(my_model2, c(1,2,5))

Residuals vs Fitted: The plot is used to detect non-linearity, unequal error variances, and outliers.

The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.

The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.

If No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers. In my_model2 there is maybe the few residuals that

Normal Q-Q

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.This plot of my_model2 shows normal distribution, only few points are outside the straight line.

Residuals vs Leverage

Residuals vs Leverage is a type of diagnostic plot that allows us to identify influential observations in a regression model.Each observation from the dataset is shown as a single point within the plot. The x-axis shows the leverage of each point and the y-axis shows the standardized residual of each point. If any point in this plot falls outside of Cook’s distance then it is considered to be an influential observation. In my_model2 no observations lie outside the Cook´s distance so this means there are not any influential points in our regression model.

Another easier way to produce diagnostics plots:

library(ggfortify)
autoplot(my_model2)


Assignment 3

Data wrangling

I did the data wrangling and the R script is found here: https://github.com/emselina/IODS-project/blob/master/create_alc.R

Data Analysis

The joined data set used in the analysisexercise combines the two student alcohol consumption data sets. The following adjustments have been made:

The variables not used for joining the two data have been combined by averaging (including the grade variables) ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

2.

#alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)

alc <- read.csv("data/alc.csv", sep=",", header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 370  35
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This dataset is combined from the two student alcohol consumption data sets. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). We only kept the students who answered the questionnaire in both math and Portuguese classes. The dataset has 370 observations and 35 variables, including character variables, numeric variables, logical variable and integer variables. Duplicated’ answers from same student were combined in the dataset.

3. The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

From the dataset I chose the variables sex, failures, absences and goout and final grade (G3). My hypothesis is that alcohol consumption is higher in men and there is statistically significant correlation between failures (number of past class failures), absences (number of school absences), going out with friends (goot) and final grade (G3) with alcohol consumption. For example the lower the final grade the more alcohol consumption and if you go out more with your friends the alcohol consumption is higher. I hypothise also that if you have more class failurres and absences from school the alcohol consumption is higher.

4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = (mean(G3)), class_failures = failures, absences = absences, going_out = goout)
## `summarise()` has grouped output by 'sex', 'high_use'. You can override using
## the `.groups` argument.
## # A tibble: 370 × 7
## # Groups:   sex, high_use [4]
##    sex   high_use count mean_grade class_failures absences going_out
##    <chr> <lgl>    <int>      <dbl>          <int>    <int>     <int>
##  1 F     FALSE      154       11.4              0        5         4
##  2 F     FALSE      154       11.4              0        3         3
##  3 F     FALSE      154       11.4              0        1         2
##  4 F     FALSE      154       11.4              0        2         2
##  5 F     FALSE      154       11.4              0        4         4
##  6 F     FALSE      154       11.4              0        1         3
##  7 F     FALSE      154       11.4              0        2         2
##  8 F     FALSE      154       11.4              0        5         4
##  9 F     FALSE      154       11.4              0        8         3
## 10 F     FALSE      154       11.4              0        3         2
## # … with 360 more rows

Sex, grades and alcohol consumption

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade") + ggtitle("Student grades by alcohol consumption and sex")

Sex, absences and alcohol consumption

# initialize a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col=sex))

# define the plot as a box plot and draw it
g2 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

Sex, going out with friends and alcohol consumption

# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = goout, col=sex))

# define the plot as a box plot and draw it
g3 + geom_boxplot() + ylab("going out with friends") + ggtitle("Student amount of time going out with friends by alcohol consumption and sex")

Male students with lower final grade from the course had higher alcohol consumption. In females same kind of correlation was not seen.

In students with higher number of school absences the alcohol consumption was higher only in men.

The alcohol consumption was higher in men and women that went out more with their friends. There were not much variation between men and women and alcohol consumption increased in both groups when the time spent with friends increased.

The hypothesis was right that school absences and lower final grade had impact on alcohol consumption and that it was seen in male group. Also time spent going out with friends had correlation with alcohol consumption but that was seen also in women so my hypothesis that men would have higher alchohol consumption was not true in with this variable.

5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables see for example the RHDS book or the first answer of this stackexchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this). (0-5 points)

# find the model with glm()
m <- glm(high_use ~ failures + absences + goout + sex, data = alc, family = "binomial")

# print out a summary of the model
m %>% 
  summary()
## 
## Call:
## glm(formula = high_use ~ failures + absences + goout + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1521  -0.7929  -0.5317   0.7412   2.3706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## absences     0.08246    0.02266   3.639 0.000274 ***
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model

m %>%
coef()
## (Intercept)    failures    absences       goout        sexM 
## -4.14389330  0.48932427  0.08245946  0.69801208  0.97988506

The p-value of the variables absences, goout and sexM (males) had p-value lower than 0.001 so these variables had significant relationship with the target variable high/low alcohol consumption. The variable failures had p-value lower than 0.05 but higher than 0.01 so there was some correlation but it was not as significant as with the other variables.

# compute odds ratios (OR)
OR <- coef(m) %>% exp 

# compute confidence intervals (CI)
CI <- confint(m) 
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01586098 -5.12422446 -3.2428932
## failures    1.63121360  0.04208372  0.9518235
## absences    1.08595465  0.03922013  0.1293995
## goout       2.00975350  0.46657743  0.9418090
## sexM        2.66415000  0.47316089  1.5009085

6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy. (0-3 points)

Variables absences, goout and sex had the highest statistical relationship with high/low alcohol consumption so I included them to the model.

# fit the model
m <- glm(high_use ~ absences + goout + sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)


# see the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, sex, goout, high_use, probability, prediction) %>% tail(10)
##     absences sex goout high_use probability prediction
## 361        3   M     3    FALSE  0.32716276      FALSE
## 362        0   M     2    FALSE  0.15403200      FALSE
## 363        7   M     3     TRUE  0.40566343      FALSE
## 364        1   F     3    FALSE  0.12866204      FALSE
## 365        6   F     3    FALSE  0.18408147      FALSE
## 366        2   F     2    FALSE  0.07202493      FALSE
## 367        2   F     4    FALSE  0.24971608      FALSE
## 368        3   F     1    FALSE  0.03919794      FALSE
## 369        4   M     5     TRUE  0.69415157       TRUE
## 370        2   M     1     TRUE  0.09434569      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   242   17
##    TRUE     61   50

graphic visualization:

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% 
  prop.table()%>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65405405 0.04594595 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000

training error

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2108108

The training error says that the average number of wrong predictions of the model is 0.21.

7. Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model? (0-2 points to compensate any loss of points from the above exercises)

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2108108
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2189189
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K=10)

# average number of wrong predictions in the 10-fold cross validation
cv$delta[1]
## [1] 0.2135135

My model has a 10-fold cross validation gave prediction error of about 0.21 so that is smaller than the error compared to the model introduced in the Exercise Set.


Assignment 4

Data wrangling

The create_human.R rscipt can be found here: https://github.com/emselina/IODS-project/blob/master/create_human.R

Data analysis

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. (0-1 points)

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data(Boston)

#Explore the structure and the dimensions of the data
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston data frame has 506 rows and 14 columns.

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

Details of the dataset: crim : per capita crime rate by town.

zn : proportion of residential land zoned for lots over 25,000 sq.ft.

indus : proportion of non-retail business acres per town.

chas :Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox : nitrogen oxides concentration (parts per 10 million).

rm : average number of rooms per dwelling.

age :proportion of owner-occupied units built prior to 1940.

dis :weighted mean of distances to five Boston employment centres.

rad :index of accessibility to radial highways.

tax :full-value property-tax rate per $10,000.

ptratio :pupil-teacher ratio by town.

black : 1000(Bk - 0.63)^21000(Bk−0.63) 2 where Bk is the proportion of blacks by town.

lstat : lower status of the population (percent).

medv : median value of owner-occupied homes in $1000s.

3. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
pairs(Boston)

I think this looks ugly and I can not see anything, but maybe it gives some kind of overview. Lets try something else.

p1 <- ggplot(Boston) + 
  geom_histogram(aes(x = crim), fill = "#E69F00") +
labs(title="per capita crime rate by town")

p2 <- ggplot(Boston) + 
    geom_histogram(aes(x = zn), fill = "#56B4E9") +
  labs(title="proportion of residential land zoned for lots over 25,000 sq.ft.")

p3 <- ggplot(Boston) + 
    geom_histogram(aes(x = indus), fill = "#009E73")+
  labs(title="proportion of non-retail business acres per town")

p4 <- ggplot(Boston) + 
    geom_histogram(aes(x = chas), fill = "#F0E442")+
  labs(title="Charles River dummy variable")

p5 <- ggplot(Boston) + 
    geom_histogram(aes(x = nox), fill = "#0072B2")+
  labs(title="nitrogen oxides concentration")

p6 <- ggplot(Boston) + 
  geom_histogram(aes(x = rm), fill = "#D55E00")+
  labs(title="average number of rooms per dwelling.")

p7 <- ggplot(Boston) + 
  geom_histogram(aes(x = age), fill = "wheat1")+
  labs(title="proportion of owner-occupied units built prior to 1940")


p8 <- ggplot(Boston) + 
  geom_histogram(aes(x = dis), fill = "green")+
  labs(title="weighted mean of distances to five Boston employment centres")

p9 <- ggplot(Boston) + 
  geom_histogram(aes(x = rad ), fill = "mediumorchid2")+
  labs(title="index of accessibility to radial highways")

p10 <- ggplot(Boston) + 
  geom_histogram(aes(x = tax  ), fill = "grey")+
  labs(title="full-value property-tax rate per $10,00")

p11 <- ggplot(Boston) + 
  geom_histogram(aes(x = ptratio  ), fill = "pink4")+
  labs(title="pupil-teacher ratio by town")

p12 <- ggplot(Boston) + 
  geom_histogram(aes(x = black  ), fill = "black")+
  labs(title="1000(Bk−0.63)2 where Bk is the proportion of blacks by town.")

p13 <- ggplot(Boston) + 
  geom_histogram(aes(x = lstat  ), fill = "seagreen2")+
  labs(title="lower status of the population")


p14 <- ggplot(Boston) + 
  geom_histogram(aes(x = medv  ), fill = "coral")+
  labs(title="median value of owner-occupied homes in $1000s.")
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
gridExtra::grid.arrange(p1, p2, p3, p4, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(gridExtra)

gridExtra::grid.arrange(p5, p6, p7, p8, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(gridExtra)

gridExtra::grid.arrange(p9, p10, p11, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(gridExtra)

gridExtra::grid.arrange(p12, p13, p14, nrow=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d",  tl.cex = 0.6)

This correlation plot tells the correlation between variables, the bigger and more colourfull the circles are the more the is correlation between the variables. The red color indicates negative correlation and blue color positive correlation. The highest positive correlation is between tax and rad variables (0.91). The biggest negative correlation is between nox and dis variables (-0.77). Second highest negative correlation is between dis and age (0.75).

4. Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

# center and standardize variables
boston_scaled <- scale(Boston) 

# summaries of the scaled variables

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- c(quantile(boston_scaled$crim))
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels =c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ purrr   0.3.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ MASS::select()       masks dplyr::select()
boston_scaled <- boston_scaled %>% 
mutate(crime = crime)

# remove the crime variable from test data
boston_scaled <- dplyr::select(boston_scaled, -crim)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5. Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2376238 0.2500000 0.2673267 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.93831633 -0.9458551 -0.15302300 -0.8640260  0.40187692 -0.8493141
## med_low  -0.05266254 -0.2980300 -0.02626030 -0.5883292 -0.13193086 -0.3983992
## med_high -0.39214634  0.2381533  0.19544522  0.4418110  0.01840258  0.4441152
## high     -0.48724019  1.0169921 -0.05360128  1.0262205 -0.36803556  0.8150158
##                 dis        rad        tax     ptratio       black      lstat
## low       0.8492601 -0.7057751 -0.7396074 -0.41150557  0.37639840 -0.7678360
## med_low   0.4001456 -0.5368402 -0.4558693 -0.08002138  0.34737407 -0.1502520
## med_high -0.3843529 -0.4781376 -0.3345594 -0.29136120  0.05546957  0.0809301
## high     -0.8587967  1.6393984  1.5149640  0.78225547 -0.85624014  0.8649756
##                 medv
## low       0.48989089
## med_low   0.01070377
## med_high  0.13239915
## high     -0.68302809
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.066542193  0.59690683 -1.0721829
## indus    0.141843592 -0.29416753  0.5086185
## chas     0.002818211 -0.06115168  0.1271707
## nox      0.301632344 -0.81615488 -1.3102904
## rm      -0.025062188 -0.06833233 -0.1354757
## age      0.051144498 -0.32401891 -0.2635587
## dis     -0.071188977 -0.31365022  0.3493796
## rad      4.039203659  0.89676529 -0.1244605
## tax      0.077796708  0.07954045  0.6351783
## ptratio  0.127585824 -0.05727966 -0.4042162
## black   -0.078129778  0.06113708  0.1919681
## lstat    0.237233844 -0.31569212  0.4411876
## medv     0.095523454 -0.47277135 -0.1237880
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9638 0.0269 0.0093
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes = as.numeric(train$crime)


# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The LDA analysis separeted the high crime classes form the low and medium high crime classes. The variable that effects the most is the rad variable = accessibility to radial highways. There is also some separation between low and medium high crime classes and the varibles that mostly affect to this are zn = proportion of residential land zoned for lots over 25,000 sq.ft. and nox = nitrogen oxides concentration (parts per 10 million). The LDI 1 separates 95 % of the populatio, LDI 2, 3,5 % and LDI 3 1,5 % of the population.

6. Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17      10        1    0
##   med_low    6      19        5    0
##   med_high   0       6       15    4
##   high       0       0        0   19

The model predicted the high crime rate well since all predicted observations are in the high crime category. The model did not predict the low crime category well.

7. Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

# access the MASS package
library(MASS)

# load the data
data(Boston)

# center and standardize variables
boston_scaled <- scale(Boston) 

# euclidean distance matrix
dist_eu <- dist(Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.624 170.539 226.315 371.950 626.047
# manhattan distance matrix
dist_man <- dist(Boston, method= "manhattan")

# look at the summary of the distances
summary(dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Optimal number of clusters is when the total WCSS drops radically so according to the plot about 2 clusters.

set.seed(123)

# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# plot the Boston dataset with clusters
pairs(Boston[1:7], col = km$cluster)

# plot the Boston dataset with clusters
pairs(Boston[8:14], col = km$cluster)

The two clusters are clearly separates in some of the variables.

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

# load the data
data(Boston)

# center and standardize variables
boston_scaled_2 <- scale(Boston) 

# change the object to data frame
boston_scaled_2 <- as.data.frame(boston_scaled_2)

# k-means clustering
km <- kmeans(boston_scaled_2, centers = 4)

boston_scaled_2 <- boston_scaled_2 %>% 
mutate(clusters = c(km$cluster))

# linear discriminant analysis
lda.km <- lda(clusters ~ ., data = boston_scaled_2)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric

classes <- as.numeric(boston_scaled_2$clusters)

# plot the lda results
plot(lda.km, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.km, myscale = 3)

The four clusters have separated well. The zn, rad, tax, nox and age and chas are the variables that affect the separation most.

Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.

#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster[ind])

The first plot shows shows that the two clusters are clearly separated. In the second plot high crime rate has separated from the other crime levels. In the third plot the fourth cluster is clearly separated from the other clusters just like in plot 2.


Assignment 5.

Data wrangling

The data wrangling part and the R scipt created are in the data folder in: https://github.com/emselina/IODS-project/blob/master/create_human.R

Data analysis

I did complete the data wrangling part but just in case I used the given dataset:

human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep =",", header = T)

1. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(GGally)
color_correlation <- function(data, mapping, method="p", use="pairwise", ...){
    # Function by user20650 on Stackoverflow (https://stackoverflow.com/a/53685979)
    # grab data
    x <- eval_data_col(data, mapping$x)
    y <- eval_data_col(data, mapping$y)

    # calculate correlation
    corr <- cor(x, y, method=method, use=use)

    # calculate colour based on correlation value
    # Here I have set a correlation of minus one to blue, 
    # zero to white, and one to red 
    colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
    fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]

    ggally_cor(data = data, mapping = mapping, ...) + 
      theme_void() +
      theme(panel.background = element_rect(fill=fill))
  }


ggpairs(
  data = human,
  upper = list(continuous = color_correlation),
  lower = list(continuous = wrap("points", alpha = 0.3, size=0.3)),
)

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(corrplot)
cor_matrix <- cor(human) 
corrplot(cor_matrix, method="circle")

The human dataset has 155 observations and 8 variables. Edu2FM variable is ratio between females and men population with secondary education, the variable has slightly negatively skewed distribution. LaboFM is variable is the ratio between females and males labour force participation rate and the variable has slightly negatively skewed distribution. Variable Life.Exp (Life.Expectancy.at.Birth) has slighly negatively skewed distribution. Variable Edu.Exp (Expected.Years.of.Education) has almost normal distribution. Variables GNI (Gross.National.Income..GNI..per.Capita), Mat.Mor (Maternal.Mortality.Ratio), Ado.Birth (Adolescent.Birth.Rate) and ParliF (Labour.Force.Participation.Rate female) have a positively skewed distribution.

There is statistically significant positive correlation Life.Exp and Edu.Exp and also between Life.Exp and Edu2FM. There is also strong positive statistically significant correlation between Ado.Birth and Mat.Mor. There is statistically significant positive correlation also between GNI and Edu2FM, GNI and Edu.Exp and GNI and Life Exp. There is statistically significant negative correlation between Mat.Mor and Life.Exp, Mat.Mor and Edu.Exp, Mat.Mor and Edu2FM and Mat.Mor and GNI. There is also negative correlation between Ado.Birth and Edu2FM, Edu.Exp, Life.Exp and GNI.

2. Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

3. Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis to the standardized human data
pca_human <- prcomp(human_std)

# create and print out a summary of pca_human
s <- summary(pca_human)


# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance

pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")


# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

The non-standardized and and standardized PCA plots are different. The PCA of the on the unstandardized variables is swamped by those variables with the largest variances and the results are meaningless. That is why the analysis is done to the standardized variables, using scale() function which centers and scales the columns of a numeric matrix.

4.Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)

The PCA1 explains 53.6 % of the variance and PCA2 explains 16.2 % of the variance. Many of the African developing countries are clearly clustered together in the PCA plot on the right and the variables that mostly explains this cluster are Mat.Mor (Maternal.Mortality.Ratio,) and Ado.Birth (Adolescent.Birth.Rate) so the higher maternal and child mortality rate in these African countries clusters them together. Many European and developed countries form another cluster on the left. The variables that mostly explain this are Edu.Exp, Edu2FM, GNI and Life.Exp. So the higher educational level and ratio of females and males that have received secondary education, Gross National Income and higher life expency explains this cluster. In addition Nordic countries including Finland form one cluster in the upper left side. This cluster is mostly expalained by the variables Parli.F and Labo.FM so the more females in parliament and ratio between females and men that participate in labor force expains this cluster.

5. The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

Load the tea dataset and convert its character variables to factors:

Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
View(tea)

Visualization of the data:

Variables 1 to 12:

library(ggplot2)
library(dplyr)
library(tidyr)
pivot_longer(tea[1:12], cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Variables 13 to 18:

library(ggplot2)
library(dplyr)
library(tidyr)
pivot_longer(tea[13:18], cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Varible age:

library(ggplot2)
library(dplyr)
library(tidyr)
pivot_longer(tea[19], cols = everything()) %>% 
  ggplot(aes(value)) + 
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+ xlab("age")

variables 20 to 25:

library(ggplot2)
library(dplyr)
library(tidyr)
pivot_longer(tea[20:25], cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

library(ggplot2)
library(dplyr)
library(tidyr)
pivot_longer(tea[26:32], cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 0.2, size = 7))

6.Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data

summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

# summary of the model

summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

The dimension 1 of the MCA analysis exlains 15.24 % of the variance. The dimension 2 explains 14.23 % of the variance.

Clearly milk, Earl Grey tea and sugar cluster together so people who drink Earl Grey often put milk in their tea and use sugar as well. Tea shop and unpackaged tea cluster together and usually unpackaged tea is sold more in tea shops so this explains this cluster. In addition tea bag and chain store cluster together so people buy more tea bags at chains stores. Black tea and no sugar cluster together so with black tea people do not use sugar that often.

Another way to plot the mca:

library(FactoMineR)
plotellipses(mca)


(more chapters to be added similarly as we proceed with the course!)